This report explores a dataset containing subjective quality assessment scores and contributing physiochemical test measurements for approximately 1,600 red wines of the Vinho Verde variety.
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Our dataset consists of 12 variables, with approximately 1,600 observations.
The distribution of perceived quality is normal with score out of 10 ranging from 3 to 8.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
##
## 7.2 7.1 7.8 7.5 7 7.7 6.8 7.6 8.2 7.3 7.4 7.9 8 8.3 6.9
## 67 57 53 52 50 49 46 46 45 44 44 42 42 40 38
## 6.6 8.8 8.9 9.1 6.7 8.6 8.1 8.4 9 9.9 6.4 8.7 10 9.3 10.4
## 37 34 33 29 28 27 26 26 26 26 25 24 23 22 21
## 6.2 8.5 10.2 6.5 9.4 9.6 6.1 9.2 9.8 5.6
## 20 19 19 17 17 17 16 16 15 14
The majority of the wines have tartaric acid concentrations from 7g to 10g / dm^3; median 7.9g / dm^3 and mean 8.32g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most wines have concentrations of acetic acid from 0.3g to 0.7g / dm^3; median 0.5200g / dm^3 and mean 0.5278g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## FALSE TRUE
## 1598 1
Most wines have citric acid concentrations between 0.09g and 0.45g / dm^3. The distribution is skewed right with spikes in observations at 0g, 0.24g and 0.49g. Citric acid is a flavour enhancing additive. We can see from the histogram that a lot of producers choose not to add it. The 0.25g and 0.49g concentrations are probably just a result of people rounding additions. One producer seems to have doubled down and added a whopping 1g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The vast majority of wines contain between 1g and 3g / dm^3 residual sugar (post fermentation). There is quite a long tail on the histogram with a few outliers containing more than 10g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Sodium chloride (salt) is mainly found in concentrations between 0.05g and 0.1g / dm^3. The tail end of the distribution is similar to that of residual sugar. It might be interesting to see if these outliers represent the same observations.
##
## FALSE TRUE
## 1570 29
##
## FALSE TRUE
## 1577 22
## [1] fixed.acidity volatile.acidity citric.acid
## [4] residual.sugar chlorides free.sulfur.dioxide
## [7] total.sulfur.dioxide density pH
## [10] sulphates alcohol quality
## <0 rows> (or 0-length row.names)
There appears to be no direct relationship between residual sugar and chlorides
The distributions of sulphates vs. free sulfur dioxide vs. total sulfur dioxide is quite similar. Like my previous inquiry I am curious to see if the outliers will be from the same observations.
##
## FALSE TRUE
## 1591 8
##
## FALSE TRUE
## 1595 4
##
## FALSE TRUE
## 1597 2
## [1] fixed.acidity volatile.acidity citric.acid
## [4] residual.sugar chlorides free.sulfur.dioxide
## [7] total.sulfur.dioxide density pH
## [10] sulphates alcohol quality
## <0 rows> (or 0-length row.names)
Once again, none of the outliers are constant across each variable.
##
## 0.9972 0.9968 0.9976 0.998 0.9962 0.9978 0.9964 0.997 0.9994
## 36 35 35 29 28 26 25 24 24
## 0.9966 0.9982 0.9974 0.9984 0.9988 0.9986 0.9969 0.9973 0.9963
## 23 23 22 20 20 19 18 18 15
## 0.9955 0.9956 0.9958 0.9979 0.9959 0.996 0.9967 0.9971 0.9987
## 14 14 14 14 13 13 13 13 12
## 0.9996 0.99538 0.9965 0.995 0.9961 0.9981 0.9991 0.9998 1
## 12 11 11 10 10 10 10 10 10
## 1.0002 0.9948 0.9952 0.99572
## 10 9 9 9
For density, the highest frequencies of observations end with an even number. This accounts for the spikes in the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH levels are most dominant between 3.2 and 3.5
##
## 9.5 9.4 9.8 9.2 10 10.5 9.3 9.6 11 9.7 9.9 10.9 10.1 10.2 10.8
## 139 103 78 72 67 67 59 59 59 54 49 49 47 46 42
## 10.4 11.2 10.3 11.3 11.4 9 11.5 11.8 10.6 10.7 11.1 9.1 11.7 12 12.5
## 41 36 33 32 32 30 30 29 28 27 27 23 23 21 21
## 11.9 12.8 11.6 12.1 12.4 12.2 12.3 12.7 12.9 14
## 20 17 15 13 13 12 12 9 9 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Vinho Verde wine (red variety) is mainly represented with alcohol volumes ranging from 9.5% to 11.5%; median 10.2% and mean 10.42%.
Most of the wines fall between 9% and 11% alcohol by volume with gradually fewer wines of higher alcohol volume. The majority of wines are between 3.2pH and 3.4pH. Fixed acidity is skewed to the right with most wines containing concentrations of tartaric acid at 9g / dm^3 or less.
The dataset consists of 1,599 red wines (Vinho Verde) with 11 scientifically measured variables (numeric) and 1 sensory output variable (integer) in the form of a score (1-10).
Other observations:
The main features of the data set are volatile acidity, alcohol and score. I am interested in creating a link between score and the other two variables. My suspicions tell me that sulphate levels can in part help predict score.
Acidity levels and sulphate levels will probably influence the score. I think alcohol volume and volatile acidity are the biggest predictors.
Yes. I categorised the variables citric.acid and volatile.acidity as new factor variables.
Citric acid was probably the most unusual with so many observations forgoing the addition. The spike around 0.5 for that same variable is also interesting.
## citric.acid fixed.acidity volatile.acidity
## citric.acid 1.000 0.672 -0.552
## fixed.acidity 0.672 1.000 -0.256
## volatile.acidity -0.552 -0.256 1.000
## residual.sugar 0.144 0.115 0.002
## chlorides 0.204 0.094 0.061
## sulphates 0.313 0.183 -0.261
## total.sulfur.dioxide 0.036 -0.113 0.076
## density 0.365 0.668 0.022
## alcohol 0.110 -0.062 -0.202
## quality 0.226 0.124 -0.391
## residual.sugar chlorides sulphates
## citric.acid 0.144 0.204 0.313
## fixed.acidity 0.115 0.094 0.183
## volatile.acidity 0.002 0.061 -0.261
## residual.sugar 1.000 0.056 0.006
## chlorides 0.056 1.000 0.371
## sulphates 0.006 0.371 1.000
## total.sulfur.dioxide 0.203 0.047 0.043
## density 0.355 0.201 0.149
## alcohol 0.042 -0.221 0.094
## quality 0.014 -0.129 0.251
## total.sulfur.dioxide density alcohol quality
## citric.acid 0.036 0.365 0.110 0.226
## fixed.acidity -0.113 0.668 -0.062 0.124
## volatile.acidity 0.076 0.022 -0.202 -0.391
## residual.sugar 0.203 0.355 0.042 0.014
## chlorides 0.047 0.201 -0.221 -0.129
## sulphates 0.043 0.149 0.094 0.251
## total.sulfur.dioxide 1.000 0.071 -0.206 -0.185
## density 0.071 1.000 -0.496 -0.175
## alcohol -0.206 -0.496 1.000 0.476
## quality -0.185 -0.175 0.476 1.000
Density appears to correllate with a number of variables. Fixed acidity is the most notable. Alcohol, citric acid and residual sugar also appear to influence density to some degree.
Looking at a subset of the data, fixed acidity, residual sugar and density appear to have little or no impact on the quality of the wine. Alcohol on the other hand has a notable correlation with both density and quality. The next step will take a closer look at the realtionships between quality and a few other variables like alcohol, residual sugar and volatile acidity.
As alcohol volume increases, the variance in quality decreases. The vertical lines represent the resolution of the measurement, which is rounded to one decimal place. The relationship between alcohol and quality appaers to be linear.
The plot is scaled to exclude the top 1% of observed residual sugar values. Most of the wines have residual sugar between 1g and 3g / dm^3. Clearly there is no meaningful relationship here.
The observations have been scaled like before. The addition of jitter and transparency helps to highlight a clear negative correlation between volatile acidity and quality.
Quality correlates most significantly with alcohol volume, and shows a negative correlation to a lesser degree with volatile acidity.
As alcohol level increases, the variance in quality decreases. On the plot representing the relationship between quality and alcohol, the observations become fewer as the alcohol level increases with a noticeable lack of lower scores. The higher score frequency shows a slight increase. The relationship looks to be linear.
Citric acid levels are showing a decent positive correlation with fixed acidity and a negative correlation with volatile acidity. Fixed acidity seems to be a factor in citric acid additions. The addition of citric acid while providing enhanced flavour may also have a curtailing effect of volatile acid producing microbes.
The quality of a wine is positively and reasonably correlated with alcohol volume, while negatively and slightly less correlated with volatile acidity. No other variables show a significant correlation with quality. Since both variables show no significant correlation with each other, there is an opportunity to explore both when generating predictive models.
When volatile acidity increases, the median quality score decreses. It also appears that alcohol volume percentages of 12 or above are indicative of higher scores which seem to be aided by low volatile acidity.
The ralationship between citric acid concentration has no impact on the quality of the wine. With the addition of the alcohol variable. It is once again evident that higher alcohol is somewhat associated with higher quality.
If a linear predictive model for quality can be built, looking at other variables and their correlations with quality will be required.
Fixed acidity appears to have no impact on quality. Volatile acidity in the presence of fixed acidity does not follow any identifiable pattern.
Citric acid levels show some correlation with fixed acidity with higher levels measured as fixed acidity increases. Citric acid has no noticeable impact on quality in the presense of fixed acidity.
Wines with higher volatile acidity have lower median quality scores per alcohol volume. The variance appears quite constant across the groups with high levels showing the least variance between the 1st and 3rd quartiles.
Low levels of citric acid correspond with higher levels of volatile acidity.
Higher citric acid levels correspond with higher levels of fixed acidity but don’t show a meaningful relationship with quality.
The variance in fixed acidity grew as citric acid levels increased with High levels of citric acid showing perhaps the biggest variance in terms of fixed acidity and quality.
No models were created.
The distribution of wine quality is normal. Quality ratings greater than 5 are better represented than those below 5.
Wines with the highest levels of volatile acidity bring the median quality score down slightly. In the presence of volatile acidity, alcohol volume plays a less noticeable part in overall quality.
This plot shows the lack of clear and obvious causes for predicting quality. The distributions follow no clear pattern to help with establishing a predictive linear model.
The Vinho Verde red wine data set contains numeric measurements and sensory output on almost 1,600 Vinho Verde red wines across 12 variables from around 2009. This study started by looking at single variables in the data set, from which I took a deeper look using different plotting choices.
The relationship between quality score and alcohol volume showed some promise early on. I was a little surprised that sulfur dioxide levels had a negligible impact on quality.
With such a small variance on the output variable, building a predictive model may not be that useful. Even the strongest correlations were not terribly significant is this data set. The big take away from this examination is that alcohol volume has a noticeable impact producing higher average scores and volatile acidity will bring the quality down. However, In the epresence of other variables these observations fall short of being reliable. Perceived wine quality appears to be all about striking the right balance rather than executing an exact formulation.